library(Hmisc)
library(tidyverse)Homework 1
Load Packages
Problem 1
Survey
Time completed survey: Thursday August 29th, 9:21 pm.
Campuswire
Insert the image you uploaded to Campuswire here.
Problem 2
Question 1
The study population for both Data Set one and Data set two is all people with experiences with crime in Britian.
Question 2
The sampling strategy of data set one is a voluntary response, the sampling strategy of the data set two is a retrospective study.
Question 3
The sampled population of data set one is 38,000 people who are living in England and Wales, 16 years and older not living in communal living.
The sample for data set 2 criminal records held by UK police of crimes that have been investigated.
Question 4
The target population are those 16 and up not living in communal living. The target population of the data set two is people who are in UK police records.
Question 5
In the first data set there there is self-reported data that is used, this is not reliable because when data is self reported there is bias that goes into those who are taking the samples. The data of the first data set is not valid to the entire population because it starts at 16 years of age so it is not representative to the whole British population along with not being representative of people who live in communal living. The goal was to be representative of the British population, it is not exactly representative of the whole population because it included 16 and up not living in communal style living.
This data is reliable because it come from the criminal records from the UK police, it is accurate but it is not representative to the UK population. Which makes the study not valid. The sample population does not represent the study population because it only takes data from the UK’s police criminal records not the whole population. The conclusions from this study can not apply to the target population because it only represents records of crimes not the population as a whole.
Problem 3
Question 1
The <- notation is equivalent to an = sign in R and is often used to declare variables. After running this code chunk, the named dataframe df appears in the environment on the right-hand side of RStudio.
df <- read_csv('https://www.openintro.org/data/csv/babies.csv')Rows: 1236 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (8): case, bwt, gestation, parity, age, height, weight, smoke
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Question 2
The notation Hmisc:: directly calls this function from the Hmisc package. describe() is a common function name, and sometimes this is needed to indicate to R which function from which package you want to use. The pipe feature |> sends the results of the first line directly into the function on the 2nd line and is a convenient way to chain functions together.
This code prints a useful and attractive summary of the data set we are using.
Hmisc::describe(df) |>
html()8 Variables 1236 Observations
case
n missing distinct Info Mean Gmd .05 .10 .25
1236 0 1236 1 618.5 412.3 62.75 124.50 309.75
.50 .75 .90 .95
618.50 927.25 1112.50 1174.25
lowest : 1 2 3 4 5 , highest: 1232 1233 1234 1235 1236
bwt
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1236 | 0 | 107 | 1 | 119.6 | 20.33 | 88.0 | 97.0 | 108.8 | 120.0 | 131.0 | 142.0 | 149.0 |
gestation
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1223 | 13 | 106 | 0.999 | 279.3 | 16.57 | 252.0 | 262.0 | 272.0 | 280.0 | 288.0 | 295.8 | 302.0 |
parity
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 1236 | 0 | 2 | 0.57 | 315 | 0.2549 | 0.3801 |
age
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1234 | 2 | 30 | 0.997 | 27.26 | 6.506 | 19 | 20 | 23 | 26 | 31 | 36 | 38 |
height
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1214 | 22 | 19 | 0.986 | 64.05 | 2.839 | 60 | 61 | 62 | 64 | 66 | 67 | 68 |
Value 53 54 56 57 58 59 60 61 62 63 64 65
Frequency 1 1 1 1 10 26 55 105 131 166 183 182
Proportion 0.001 0.001 0.001 0.001 0.008 0.021 0.045 0.086 0.108 0.137 0.151 0.150
Value 66 67 68 69 70 71 72
Frequency 153 105 54 20 13 6 1
Proportion 0.126 0.086 0.044 0.016 0.011 0.005 0.001
For the frequency table, variable is rounded to the nearest 0
weight
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1200 | 36 | 105 | 0.999 | 128.6 | 22.39 | 102.0 | 105.0 | 114.8 | 125.0 | 139.0 | 155.0 | 170.0 |
smoke
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 1226 | 10 | 2 | 0.717 | 484 | 0.3948 | 0.4782 |
Question 3
The Child Health and Development Studies investigate a range of topics. One study, in particular, considered all pregnancies between 1960 and 1967 among women in the Kaiser Foundation Health Plan in the San Francisco East Bay area. The variables in this data set are as follows.
| Variable Name | Variable Description | Variable Type |
|---|---|---|
case |
id number | numerical, discrete |
bwt |
birthweight, in ounces | numerical, continuous |
gestation |
length of gestation, in days | numerical, discrete |
parity |
binary indicator for a first pregnancy (0 = first pregnancy) | numerical, nominal |
age |
mother’s age in years | numerical, discrete |
height |
mother’s height in inches | numerical, continuous |
weight |
mother’s weight in pounds | numerical, continuous |
smoke |
binary indicator for whether the mother smokes | numerical, nominal |
Question 4
Below, 2 numeric variables were investigated for potential relationships. The independent, explanatory variable I chose is variable_weight, and the dependent, response variable I chose is variable_gestation.
df |>
ggplot(aes(x = weight,
y = gestation)) +
geom_point() +
ggtitle('The Effect of Weight on Length of Gestation') Warning: Removed 48 rows containing missing values or values outside the scale range
(`geom_point()`).
There is not much of a correrlation between the variables gestation vs weight. Much of the data is clumped together in the begining of the graph. One can suggest that as weight increased there are fewer days of gestation, but the lower the weight of the mothers the longer days of gestation, but the correlation is not very strong nor very evident through the graph.
Session Info
xfun::session_info()R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Ventura 13.5.1
Locale: en_US.UTF-8 / en_US.UTF-8 / en_US.UTF-8 / C / en_US.UTF-8 / en_US.UTF-8
Package version:
askpass_1.2.0 backports_1.5.0 base64enc_0.1-3
bit_4.0.5 bit64_4.0.5 blob_1.2.4
broom_1.0.6 bslib_0.8.0 cachem_1.1.0
callr_3.7.6 cellranger_1.1.0 checkmate_2.3.2
cli_3.6.3 clipr_0.8.0 cluster_2.1.6
colorspace_2.1-1 compiler_4.4.1 conflicted_1.2.0
cpp11_0.4.7 crayon_1.5.3 curl_5.2.1
data.table_1.15.4 DBI_1.2.3 dbplyr_2.5.0
digest_0.6.37 dplyr_1.1.4 dtplyr_1.3.1
evaluate_0.24.0 fansi_1.0.6 farver_2.1.2
fastmap_1.2.0 fontawesome_0.5.2 forcats_1.0.0
foreign_0.8-86 Formula_1.2-5 fs_1.6.4
gargle_1.5.2 generics_0.1.3 ggplot2_3.5.1
glue_1.7.0 googledrive_2.1.1 googlesheets4_1.1.1
graphics_4.4.1 grDevices_4.4.1 grid_4.4.1
gridExtra_2.3 gtable_0.3.5 haven_2.5.4
highr_0.11 Hmisc_5.1-3 hms_1.1.3
htmlTable_2.4.3 htmltools_0.5.8.1 htmlwidgets_1.6.4
httr_1.4.7 ids_1.0.1 isoband_0.2.7
jquerylib_0.1.4 jsonlite_1.8.8 knitr_1.48
labeling_0.4.3 lattice_0.22.6 lifecycle_1.0.4
lubridate_1.9.3 magrittr_2.0.3 MASS_7.3.60.2
Matrix_1.7.0 memoise_2.0.1 methods_4.4.1
mgcv_1.9.1 mime_0.12 modelr_0.1.11
munsell_0.5.1 nlme_3.1.164 nnet_7.3-19
openssl_2.2.1 parallel_4.4.1 pillar_1.9.0
pkgconfig_2.0.3 prettyunits_1.2.0 processx_3.8.4
progress_1.2.3 ps_1.7.7 purrr_1.0.2
R6_2.5.1 ragg_1.3.2 rappdirs_0.3.3
RColorBrewer_1.1.3 readr_2.1.5 readxl_1.4.3
rematch_2.0.0 rematch2_2.1.2 reprex_2.1.1
rlang_1.1.4 rmarkdown_2.28 rpart_4.1.23
rstudioapi_0.16.0 rvest_1.0.4 sass_0.4.9
scales_1.3.0 selectr_0.4.2 splines_4.4.1
stats_4.4.1 stringi_1.8.4 stringr_1.5.1
sys_3.4.2 systemfonts_1.1.0 textshaping_0.4.0
tibble_3.2.1 tidyr_1.3.1 tidyselect_1.2.1
tidyverse_2.0.0 timechange_0.3.0 tinytex_0.52
tools_4.4.1 tzdb_0.4.0 utf8_1.2.4
utils_4.4.1 uuid_1.2.1 vctrs_0.6.5
viridis_0.6.5 viridisLite_0.4.2 vroom_1.6.5
withr_3.0.1 xfun_0.47 xml2_1.3.6
yaml_2.3.10